{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "97c79c38-38a3-40f3-ba2e-250649347d63",
   "metadata": {},
   "source": [
    "# Multimodal Parsing using GPT4o-mini\n",
    "\n",
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_parse/blob/main/examples/multimodal/gpt4o_mini.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
    "\n",
    "This cookbook shows you how to use LlamaParse to parse any document with the multimodal capabilities of GPT4o-mini.\n",
    "\n",
    "LlamaParse allows you to plug in external, multimodal model vendors for parsing - we handle the error correction, validation, and scalability/reliability for you.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15e60ecf-519c-41fc-911b-765adaf8bad4",
   "metadata": {},
   "source": [
    "## Setup\n",
    "\n",
    "Download the data - the blog post from Meta on Llama3.1, in PDF form."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "91a9e532-1454-40e0-bbf0-fd442c350121",
   "metadata": {},
   "outputs": [],
   "source": [
    "import nest_asyncio\n",
    "\n",
    "nest_asyncio.apply()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d9fb0aa-74cd-476f-8161-efd9e04248bf",
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget \"https://www.dropbox.com/scl/fi/8iu23epvv3473im5rq19g/llama3.1_blog.pdf?rlkey=5u417tbdox4aip33fdubvni56&st=dzozd11e&dl=1\" -O \"data/llama3.1_blog.pdf\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c70d420d-1778-4b0d-81e2-db09276e90cf",
   "metadata": {},
   "source": [
    "![llama_blog_img](llama3.1-p5.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4e29a9d7-5bd9-4fb8-8ec1-4c128a748662",
   "metadata": {},
   "source": [
    "## Initialize LlamaParse\n",
    "\n",
    "Initialize LlamaParse in multimodal mode, and specify the vendor.\n",
    "\n",
    "**NOTE**: optionally you can specify the OpenAI API key. If you do so you will be charged our base LlamaParse price of 0.3c per page. If you don't then you will be charged 1.5c per page, as we will make the calls to gpt4o-mini for you and give you price predictability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dc921729-3446-42ca-8e1b-a6fd26195ed9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.schema import TextNode\n",
    "from typing import List\n",
    "import json\n",
    "\n",
    "\n",
    "def get_text_nodes(json_list: List[dict]):\n",
    "    text_nodes = []\n",
    "    for idx, page in enumerate(json_list):\n",
    "        text_node = TextNode(text=page[\"md\"], metadata={\"page\": page[\"page\"]})\n",
    "        text_nodes.append(text_node)\n",
    "    return text_nodes\n",
    "\n",
    "\n",
    "def save_jsonl(data_list, filename):\n",
    "    \"\"\"Save a list of dictionaries as JSON Lines.\"\"\"\n",
    "    with open(filename, \"w\") as file:\n",
    "        for item in data_list:\n",
    "            json.dump(item, file)\n",
    "            file.write(\"\\n\")\n",
    "\n",
    "\n",
    "def load_jsonl(filename):\n",
    "    \"\"\"Load a list of dictionaries from JSON Lines.\"\"\"\n",
    "    data_list = []\n",
    "    with open(filename, \"r\") as file:\n",
    "        for line in file:\n",
    "            data_list.append(json.loads(line))\n",
    "    return data_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f2e9d9cf-8189-4fcb-b34f-cde6cc0b59c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id bf3e7341-bb11-42d4-a5f7-bb5260ad792c\n"
     ]
    }
   ],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "parser = LlamaParse(\n",
    "    result_type=\"markdown\",\n",
    "    use_vendor_multimodal_model=True,\n",
    "    vendor_multimodal_model_name=\"openai-gpt-4o-mini\",\n",
    "    invalidate_cache=True,\n",
    ")\n",
    "json_objs = parser.get_json_result(\"./data/llama3.1_blog.pdf\")\n",
    "json_list = json_objs[0][\"pages\"]\n",
    "docs = get_text_nodes(json_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96a81df0-1026-4e30-a930-f677dc31e344",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional: Save\n",
    "save_jsonl([d.dict() for d in docs], \"docs.jsonl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee2e6920-8893-4b39-ae12-94d13c651406",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional: Load\n",
    "from llama_index.core import Document\n",
    "\n",
    "docs_dicts = load_jsonl(\"docs.jsonl\")\n",
    "docs = [Document.parse_obj(d) for d in docs_dicts]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4f3c51b0-7878-48d7-9bc3-02b516500128",
   "metadata": {},
   "source": [
    "### Setup GPT-4o baseline\n",
    "\n",
    "For comparison, we will also parse the document using GPT-4o (3c per page)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6fc3f258-50ae-4988-b904-c105463a498f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Started parsing the file under job_id 391ff280-08e5-4143-85f2-90ada287e26c\n"
     ]
    }
   ],
   "source": [
    "from llama_parse import LlamaParse\n",
    "\n",
    "parser_gpt4o = LlamaParse(\n",
    "    result_type=\"markdown\",\n",
    "    use_vendor_multimodal_model=True,\n",
    "    vendor_multimodal_model=\"openai-gpt4o\",\n",
    "    # invalidate_cache=True\n",
    ")\n",
    "json_objs_gpt4o = parser_gpt4o.get_json_result(\"./data/llama3.1_blog.pdf\")\n",
    "# json_objs_gpt4o = parser.get_json_result(\"./data/llama2-p33.pdf\")\n",
    "json_list_gpt4o = json_objs_gpt4o[0][\"pages\"]\n",
    "docs_gpt4o = get_text_nodes(json_list_gpt4o)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a47f04e-12e1-4c80-a71d-ef7721f96401",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional: Save\n",
    "save_jsonl([d.dict() for d in docs_gpt4o], \"docs_gpt4o.jsonl\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c38b5ca3-fa87-434b-b477-bf6a4962eb3d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Optional: Load\n",
    "from llama_index.core import Document\n",
    "\n",
    "docs_gpt4o_dicts = load_jsonl(\"docs_gpt4o.jsonl\")\n",
    "docs_gpt4o = [Document.parse_obj(d) for d in docs_gpt4o_dicts]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "44c20f7a-2901-4dd0-b635-a4b33c5664c1",
   "metadata": {},
   "source": [
    "## View Results\n",
    "\n",
    "Let's visualize the results between GPT-4o-mini and GPT-4o along with the original document page.\n",
    "\n",
    "We see that \n",
    "\n",
    "**NOTE**: If you're using llama2-p33, just use `docs[0]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "778698aa-da7e-4081-b3b5-0372f228536f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page: 5\n",
      "\n",
      "# Llama 3.1 Model Evaluation\n",
      "\n",
      "## Category Benchmark\n",
      "\n",
      "| Benchmark                     | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mistral 8x228B Instruct | GPT 3.5 Turbo |\n",
      "|-------------------------------|----------------|----------------------|----------------|-------------------------|----------------|\n",
      "| General                       |                |                      |                |                         |                |\n",
      "| MMLU (0-shot, CoT)           | 73.0           | 72.3                 | 86.0           | 79.9                    | 69.8           |\n",
      "| MMLU PRO (5-shot, CoT)       | 48.3           | 36.9                 | 66.4           | 56.3                    | 49.2           |\n",
      "| IFEval                        | 80.4           | 73.6                 | 87.5           | 72.7                    | 69.9           |\n",
      "| Code                          |                |                      |                |                         |                |\n",
      "| HumanEval (0-shot)           | 72.6           | 54.3                 | 80.5           | 75.6                    | 68.0           |\n",
      "| MBPP EvalPlus (Human) (0-shot, CoT) | 72.8   | 71.7                 | 86.0           | 78.6                    | 82.0           |\n",
      "| Math                          |                |                      |                |                         |                |\n",
      "| GSM8K                         | 84.5           | 76.7                 | 95.1           | 88.2                    | 81.6           |\n",
      "| MATH (0-shot, CoT)           | 51.9           | 44.3                 | 70.8           | 54.1                    | 43.1           |\n",
      "| Reasoning                    |                |                      |                |                         |                |\n",
      "| ARC Challenge                 | 83.4           | 87.6                 | 74.2           | 87.7                    | 83.7           |\n",
      "| GPA (0-shot)                 | 32.8           | 24.8                 | 46.7           | 33.3                    | 35.8           |\n",
      "| Tool use                      |                |                      |                |                         |                |\n",
      "| BFCL                          | 76.1           | 64.0                 | 94.8           | 81.4                    | 78.0           |\n",
      "| Noxus                         | 38.5           | 30.0                 | 24.7           | 48.5                    | 37.5           |\n",
      "| Long context                  |                |                      |                |                         |                |\n",
      "| ZeroSCROLLS/QualiTY          | 81.0           | -                    | 90.5           | -                       | -              |\n",
      "| InfiniteBench/En.MC          | 65.1           | -                    | 78.2           | -                       | -              |\n",
      "| NHI/Multi-needle              | 98.8           | -                    | 97.5           | -                       | -              |\n",
      "| Multilingual                  |                |                      |                |                         |                |\n",
      "| MGSM (0-shot)                | 68.9           | 53.2                 | 86.9           | 71.1                    | 51.4           |\n",
      "\n",
      "## Llama 3.1 405B Human Evaluation\n",
      "\n",
      "| Comparison                                   | Win Rate | Tie Rate | Loss Rate |\n",
      "|----------------------------------------------|----------|----------|-----------|\n",
      "| Llama 3.1 405B vs GPT-4-0125-Preview        | 23.3%    | 52.2%    | 24.5%     |\n",
      "| Llama 3.1 405B vs GPT-4o                     | 19.1%    | 51.7%    | 29.2%     |\n",
      "| Llama 3.1 405B vs Claude 3.5 Sonnet         | 24.9%    | 50.8%    | 24.2%     |\n"
     ]
    }
   ],
   "source": [
    "# using GPT4o-mini\n",
    "print(docs[4].get_content(metadata_mode=\"all\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1511a30f-3efc-4142-9668-7dc056a24d0c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "page: 5\n",
      "\n",
      "# Introducing Llama 3.1: Our most capable models to date\n",
      "\n",
      "## Meta\n",
      "\n",
      "| Category | Benchmark | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x22B Instruct | GPT 3.5 Turbo |\n",
      "|----------|-----------|--------------|---------------|---------------------|---------------|-----------------------|---------------|\n",
      "| General  | MMLU (0-shot, CoT) | 73.0 | 72.3 (0-shot, non-CoT) | 60.5 | 86.0 | 79.9 | 69.8 |\n",
      "|          | MMLU PRO (5-shot, CoT) | 48.3 | 71.7 | 36.9 | 66.4 | 56.3 | 49.2 |\n",
      "|          | ITEval | 80.4 | 73.6 | 57.6 | 87.5 | 72.7 | 69.9 |\n",
      "| Code     | HumanEval (0-shot) | 72.6 | 54.3 | 40.2 | 80.5 | 75.6 | 68.0 |\n",
      "|          | MBPP EvalPlus (5-shot) (0-shot) | 72.8 | 71.7 | 49.5 | 86.0 | 78.6 | 82.0 |\n",
      "| Math     | GSM8K | 84.5 | 76.7 | 53.2 | 95.1 | 88.2 | 81.6 |\n",
      "|          | MATH (0-shot, CoT) | 51.9 | 44.3 | 13.0 | 68.0 | 54.1 | 43.1 |\n",
      "| Reasoning | ARC Challenge (0-shot) | 83.4 | 87.6 | 74.2 | 94.8 | 88.7 | 83.7 |\n",
      "|          | GOPA (0-shot) | 32.8 | 40.8 | 28.0 | 46.7 | - | - |\n",
      "| Tool use | BFCL | 76.1 | 60.3 | 60.4 | 94.8 | - | 85.9 |\n",
      "|          | Noxus | 38.5 | 30.0 | 24.7 | 56.7 | 48.5 | 37.2 |\n",
      "| Long context | ZeroSCROLLS/QuaLITY | 81.0 | - | - | 90.5 | - | - |\n",
      "|          | InfiniteBench/En.MC | 65.1 | - | - | 78.2 | - | - |\n",
      "|          | NIH/Multi-needle | 98.8 | - | - | 97.5 | - | - |\n",
      "| Multilingual | Multilingual MGSM (0-shot) | 68.9 | 53.2 | 29.9 | 86.9 | 71.1 | 51.4 |\n",
      "\n",
      "## Llama 3.1 405B Human Evaluation\n",
      "\n",
      "| Model Comparison | Win | Tie | Loss |\n",
      "|------------------|-----|-----|------|\n",
      "| Llama 3.1 405B vs GPT-4-0125-Preview | 23.3% | 52.2% | 24.5% |\n",
      "| Llama 3.1 405B vs GPT-4o | 19.1% | 51.7% | 29.2% |\n",
      "| Llama 3.1 405B vs Claude 3.5 Sonnet | 24.9% | 50.8% | 24.2% |\n",
      "\n",
      "https://ai.meta.com/blog/meta-llama-3-1/\n"
     ]
    }
   ],
   "source": [
    "# using GPT-4o\n",
    "print(docs_gpt4o[4].get_content(metadata_mode=\"all\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "705f7729-fa0f-4ca0-8562-c42afeaa8532",
   "metadata": {},
   "source": [
    "## Setup RAG Pipeline\n",
    "\n",
    "Let's setup a RAG pipeline over this data.\n",
    "\n",
    "(we also use gpt4o-mini for the actual text synthesis step)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a53ee5d-cc63-421b-8896-588c83edfcf0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core import Settings\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.embeddings.openai import OpenAIEmbedding\n",
    "\n",
    "Settings.llm = OpenAI(model=\"gpt-4o-mini\")\n",
    "Settings.embed_model = OpenAIEmbedding(model=\"text-embedding-3-large\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60972d7a-7948-4ad7-89df-57004acee917",
   "metadata": {},
   "outputs": [],
   "source": [
    "# from llama_index.core import SummaryIndex\n",
    "from llama_index.core import VectorStoreIndex\n",
    "from llama_index.llms.openai import OpenAI\n",
    "\n",
    "index = VectorStoreIndex(docs)\n",
    "query_engine = index.as_query_engine(similarity_top_k=5)\n",
    "\n",
    "index_gpt4o = VectorStoreIndex(docs_gpt4o)\n",
    "query_engine_gpt4o = index_gpt4o.as_query_engine(similarity_top_k=5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7df7bcb-1df4-4a01-88fc-2d596b1cc74d",
   "metadata": {},
   "outputs": [],
   "source": [
    "query = \"How does Llama3.1 compare against gpt-4o and Claude 3.5 Sonnet in human evals?\"\n",
    "\n",
    "response = query_engine.query(query)\n",
    "response_gpt4o = query_engine_gpt4o.query(query)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7070a31-3bb8-4134-8338-20bc2fd6f3d6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In human evaluations, Llama 3.1 405B has a win rate of 19.1% against GPT-4o and 24.9% against Claude 3.5 Sonnet. The tie rates for Llama 3.1 405B are 51.7% against GPT-4o and 50.8% against Claude 3.5 Sonnet, while the loss rates are 29.2% against GPT-4o and 24.2% against Claude 3.5 Sonnet. This indicates that Llama 3.1 performs competitively in comparison to both models, with a notable number of ties.\n"
     ]
    }
   ],
   "source": [
    "print(response)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bee8167-f021-4c87-8d28-9f40a4f7b69d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Llama 3.1 Model Evaluation\n",
      "\n",
      "## Category Benchmark\n",
      "\n",
      "| Benchmark                     | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mistral 8x228B Instruct | GPT 3.5 Turbo |\n",
      "|-------------------------------|----------------|----------------------|----------------|-------------------------|----------------|\n",
      "| General                       |                |                      |                |                         |                |\n",
      "| MMLU (0-shot, CoT)           | 73.0           | 72.3                 | 86.0           | 79.9                    | 69.8           |\n",
      "| MMLU PRO (5-shot, CoT)       | 48.3           | 36.9                 | 66.4           | 56.3                    | 49.2           |\n",
      "| IFEval                        | 80.4           | 73.6                 | 87.5           | 72.7                    | 69.9           |\n",
      "| Code                          |                |                      |                |                         |                |\n",
      "| HumanEval (0-shot)           | 72.6           | 54.3                 | 80.5           | 75.6                    | 68.0           |\n",
      "| MBPP EvalPlus (Human) (0-shot, CoT) | 72.8   | 71.7                 | 86.0           | 78.6                    | 82.0           |\n",
      "| Math                          |                |                      |                |                         |                |\n",
      "| GSM8K                         | 84.5           | 76.7                 | 95.1           | 88.2                    | 81.6           |\n",
      "| MATH (0-shot, CoT)           | 51.9           | 44.3                 | 70.8           | 54.1                    | 43.1           |\n",
      "| Reasoning                    |                |                      |                |                         |                |\n",
      "| ARC Challenge                 | 83.4           | 87.6                 | 74.2           | 87.7                    | 83.7           |\n",
      "| GPA (0-shot)                 | 32.8           | 24.8                 | 46.7           | 33.3                    | 35.8           |\n",
      "| Tool use                      |                |                      |                |                         |                |\n",
      "| BFCL                          | 76.1           | 64.0                 | 94.8           | 81.4                    | 78.0           |\n",
      "| Noxus                         | 38.5           | 30.0                 | 24.7           | 48.5                    | 37.5           |\n",
      "| Long context                  |                |                      |                |                         |                |\n",
      "| ZeroSCROLLS/QualiTY          | 81.0           | -                    | 90.5           | -                       | -              |\n",
      "| InfiniteBench/En.MC          | 65.1           | -                    | 78.2           | -                       | -              |\n",
      "| NHI/Multi-needle              | 98.8           | -                    | 97.5           | -                       | -              |\n",
      "| Multilingual                  |                |                      |                |                         |                |\n",
      "| MGSM (0-shot)                | 68.9           | 53.2                 | 86.9           | 71.1                    | 51.4           |\n",
      "\n",
      "## Llama 3.1 405B Human Evaluation\n",
      "\n",
      "| Comparison                                   | Win Rate | Tie Rate | Loss Rate |\n",
      "|----------------------------------------------|----------|----------|-----------|\n",
      "| Llama 3.1 405B vs GPT-4-0125-Preview        | 23.3%    | 52.2%    | 24.5%     |\n",
      "| Llama 3.1 405B vs GPT-4o                     | 19.1%    | 51.7%    | 29.2%     |\n",
      "| Llama 3.1 405B vs Claude 3.5 Sonnet         | 24.9%    | 50.8%    | 24.2%     |\n"
     ]
    }
   ],
   "source": [
    "print(response.source_nodes[1].get_content())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f9fef7f-510b-46a5-8716-f5616f542035",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "In human evaluations, Llama 3.1 405B shows competitive performance against GPT-4o and Claude 3.5 Sonnet. Specifically, when compared to GPT-4o, Llama 3.1 won 19.1% of the time, tied 51.7%, and lost 29.2%. Against Claude 3.5 Sonnet, it won 24.9% of the time, tied 50.8%, and lost 24.2%. This indicates that Llama 3.1 performs comparably in real-world scenarios against these leading models.\n"
     ]
    }
   ],
   "source": [
    "print(response_gpt4o)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d40f9dd4-2dd4-4fa5-b636-1f901dc1601b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "# Introducing Llama 3.1: Our most capable models to date\n",
      "\n",
      "## Meta\n",
      "\n",
      "| Category | Benchmark | Llama 3.1 8B | Gemma 2 9B IT | Mistral 7B Instruct | Llama 3.1 70B | Mixtral 8x22B Instruct | GPT 3.5 Turbo |\n",
      "|----------|-----------|--------------|---------------|---------------------|---------------|-----------------------|---------------|\n",
      "| General  | MMLU (0-shot, CoT) | 73.0 | 72.3 (0-shot, non-CoT) | 60.5 | 86.0 | 79.9 | 69.8 |\n",
      "|          | MMLU PRO (5-shot, CoT) | 48.3 | 71.7 | 36.9 | 66.4 | 56.3 | 49.2 |\n",
      "|          | ITEval | 80.4 | 73.6 | 57.6 | 87.5 | 72.7 | 69.9 |\n",
      "| Code     | HumanEval (0-shot) | 72.6 | 54.3 | 40.2 | 80.5 | 75.6 | 68.0 |\n",
      "|          | MBPP EvalPlus (5-shot) (0-shot) | 72.8 | 71.7 | 49.5 | 86.0 | 78.6 | 82.0 |\n",
      "| Math     | GSM8K | 84.5 | 76.7 | 53.2 | 95.1 | 88.2 | 81.6 |\n",
      "|          | MATH (0-shot, CoT) | 51.9 | 44.3 | 13.0 | 68.0 | 54.1 | 43.1 |\n",
      "| Reasoning | ARC Challenge (0-shot) | 83.4 | 87.6 | 74.2 | 94.8 | 88.7 | 83.7 |\n",
      "|          | GOPA (0-shot) | 32.8 | 40.8 | 28.0 | 46.7 | - | - |\n",
      "| Tool use | BFCL | 76.1 | 60.3 | 60.4 | 94.8 | - | 85.9 |\n",
      "|          | Noxus | 38.5 | 30.0 | 24.7 | 56.7 | 48.5 | 37.2 |\n",
      "| Long context | ZeroSCROLLS/QuaLITY | 81.0 | - | - | 90.5 | - | - |\n",
      "|          | InfiniteBench/En.MC | 65.1 | - | - | 78.2 | - | - |\n",
      "|          | NIH/Multi-needle | 98.8 | - | - | 97.5 | - | - |\n",
      "| Multilingual | Multilingual MGSM (0-shot) | 68.9 | 53.2 | 29.9 | 86.9 | 71.1 | 51.4 |\n",
      "\n",
      "## Llama 3.1 405B Human Evaluation\n",
      "\n",
      "| Model Comparison | Win | Tie | Loss |\n",
      "|------------------|-----|-----|------|\n",
      "| Llama 3.1 405B vs GPT-4-0125-Preview | 23.3% | 52.2% | 24.5% |\n",
      "| Llama 3.1 405B vs GPT-4o | 19.1% | 51.7% | 29.2% |\n",
      "| Llama 3.1 405B vs Claude 3.5 Sonnet | 24.9% | 50.8% | 24.2% |\n",
      "\n",
      "https://ai.meta.com/blog/meta-llama-3-1/\n"
     ]
    }
   ],
   "source": [
    "print(response_gpt4o.source_nodes[1].get_content())"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_parse",
   "language": "python",
   "name": "llama_parse"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
